Guessing lexicon entries using finite-state methods

نویسنده

  • Kimmo Koskenniemi
چکیده

A practical method for interactive guessing of LEXC lexicon entries is presented. The method is based on describing groups of similarly inflected words using regular expressions. The patterns are compiled into a finite-state transducer (FST) which maps any word form into the possible LEXC lexicon entries which could generate it. The same FST can be used (1) for converting conventional headword lists into LEXC entries, (2) for interactive guessing of entries, (3) for corpus-assisted interactive guessing and (4) guessing entries from corpora. A method of representing affixes as a table is presented as well how the tables can be converted into LEXC format for several different purposes including morphological analysis and entry guessing. The method has been implemented using the HFST finite-state transducer tools and its Python embedding plus a number of small Python scripts for conversions. The method is tested with a near complete implementation of Finnish verbs. An experiment of generating Finnish verb entries out of corpus data is also described as well as a creation of a full-scale analyzer for Finnish verbs using the conversion patterns.

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Proto-Indo-European Lexicon: The Generative Etymological Dictionary of Indo-European Languages

Proto-Indo-European Lexicon (PIE Lexicon) is the generative etymological dictionary of Indo-European languages. The reconstruction of Proto-Indo-European (PIE) is obtained by applying the comparative method, the output of which equals the Indo-European (IE) data. Due to this the Indo-European sound laws leading from PIE to IE, revised in Pyysalo 2013, can be coded using Finite-State Transducers...

متن کامل

Finite Automata and Eecient Lexicon Implementation Finite Automata and Eecient Lexicon Implementation

We describe a general technique for the encoding of lexical functions | such as lexical classiication, gender and number marking, innections and conjugations | using minimized acyclic nite-state automata. This technique has been used to store a Portuguese lexicon with over 2 million entries in about 1 megabyte. Unlike general le compression schemes, this representation allows random access to t...

متن کامل

Using Lexical Similarity in Handwritten Word Recognition

Recognition using only visual evidence cannot always be successful due to limitations of information and resources available during training. Considering relation among lexicon entries is sometimes useful for decision making. In this paper, we present a method to capture lexical similarity of a lexicon and reliability of a character recognizer which serve to capture the dynamism of the environm...

متن کامل

Learning Compact Lexicons for CCG Semantic Parsing

We present methods to control the lexicon size when learning a Combinatory Categorial Grammar semantic parser. Existing methods incrementally expand the lexicon by greedily adding entries, considering a single training datapoint at a time. We propose using corpus-level statistics for lexicon learning decisions. We introduce voting to globally consider adding entries to the lexicon, and pruning ...

متن کامل

SWAP paper THE ROLE OF PERCEPTUAL EPISODES IN LEXICAL PROCESSING

Nearly all theories of spoken word perception presume a lexicon with singular entries corresponding to each word. In turn, the perceptual system is presumed to operate by matching entries to the variable signals that speakers produce, requiring either normalization or sophisticated guessing. In contrast, episodic theories assume that people store multiple entries, in the form of detailed percep...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2018